SQL methods

Now let's play with databases. MySQL and SQLite databases are accessed in very similar fashion in python

Let's install the different packages

pip install MySQL-python
pip install sqlalchemy
pip install ipython-sql
pip install dbtools

SQLite and MySQL have very similar features / language, but also some subtiles differences. The main difference if that MySQL requires a server running to host the data, while SQLite can be stored in a file locally. Some pros and cons:

SQLite:

easier to setup
great for temporary (testing databases)
great for rapid development
great for embedding in an application
doesn't have user management
doesn't have many performance features
doesn't scale well.

MySQL:

far more difficult/complex to set up
better options for performance tuning
can scale well if tuned properly
can manage users, permissions, etc.

Start with sqlite, you can then test out the mysql by setting

sql_type = 'mysql'



In [1]:

    
sql_type = 'mysql'



In [2]:

    
user_name = "pire"



In [3]:

    
tbname = "py4we_" + user_name

Importing the logit magic for benchmarking the different methods



In [4]:

    
%load_ext logit

Loading Data

Let's load the data previously saved in json format



In [5]:

    
import json
with open('tmp.json','r') as f:
    json_dic = json.load(f)
json_dic.keys()









    Out[5]:





[u'latitude', u'dates', u'orientation', u'longitude', u'weekdays']



In [6]:

    
#first replace the dates by datetimes formats
from datetime import datetime
#fmt = "%Y%m%d %H:%M:%S"
#json_dic['dates'] = [datetime.strptime(r, fmt) for r in json_dic['dates']]

#Then build a numpy array containing all the informations
db_data = array([json_dic[c] for c in ['dates', 'weekdays', 'latitude', 'longitude', 'orientation']]).T
db_data[:10]









    Out[6]:





array([[u'20131114 06:05:19', u'Thursday', u'42.6040077209',
        u'13.3692359924', u'327.700012207'],
       [u'20131114 06:05:48', u'Thursday', u'42.6041603088',
        u'13.3694620132', u'325.399993896'],
       [u'20131114 06:06:19', u'Thursday', u'42.6043243408',
        u'13.369717598', u'323.700012207'],
       [u'20131114 06:06:48', u'Thursday', u'42.6044769287',
        u'13.3699598312', u'325.600006104'],
       [u'20131114 06:07:19', u'Thursday', u'42.6046409607',
        u'13.37022686', u'327.5'],
       [u'20131114 06:07:49', u'Thursday', u'42.6047935486',
        u'13.3704900742', u'327.799987793'],
       [u'20131114 06:08:18', u'Thursday', u'42.6049423218',
        u'13.3707456589', u'330.399993896'],
       [u'20131114 06:08:49', u'Thursday', u'42.6050987244',
        u'13.3710184097', u'331.200012207'],
       [u'20131114 06:09:19', u'Thursday', u'42.6052474976',
        u'13.3712882996', u'331.100006104'],
       [u'20131114 06:09:49', u'Thursday', u'42.6053924561',
        u'13.3715610504', u'331.600006104']], 
      dtype='<U17')

Ipython SQL

This is the easiest to use mysql, but it only works within ipython notebooks, and it has some limitations.

First you need to load the sql magic.



In [5]:

    
%load_ext sql

It works both for SQLite and MySQL (and PostgreSQL if you are in that sort of things).



In [8]:

    
if sql_type == 'sqlite':
    ## Connect
    %sql sqlite:///mysqlite.db
    # Print out what are the tables
    res = %sql SELECT * FROM sqlite_master WHERE type='table'
elif sql_type == 'mysql':
    ## Connect
    %sql mysql://student:DTU63Course@10.40.20.14/coursedb 
    ## Print out what are the tables
    res = %sql SHOW TABLES
res









    



6 rows affected.






    Out[8]:





    
        Tables_in_coursedb
    
    
        'py4we_pire'
    
    
        py4we_exercise
    
    
        py4we_pire
    
    
        py4we_tb
    
    
        py4we_tb_pd
    
    
        test_sql

Delete the table



In [9]:

    
%sql DROP TABLE test_sql









    



0 rows affected.






    Out[9]:





[]

Create the Table



In [10]:

    
%%sql 
CREATE TABLE test_sql
(
    dates VARCHAR(20),
    weekdays VARCHAR(10),
    latitude DOUBLE,
    longitude DOUBLE,
    orientation DOUBLE
);









    



0 rows affected.






    Out[10]:





[]

It's possible to insert the data into the database line by line, but it's going to be painfully slow for large datasets.

Notice the way the python variables are added in the SQL query using the ":" in front of it. It's a bit tricky because it will replace the variable name by its value with '' around it. There is unfortunately a small bug when you try to use this to define the table name. It will replace :table_name by 'table_name' which is not a valid SQL query.



In [11]:

    
N = 20
for date, weekd, latitude, longitude, orientation in db_data[:N]:
    res = %sql INSERT INTO test_sql VALUES (:date, :weekd, :latitude, :longitude, :orientation);
%sql SELECT * FROM test_sql









    



1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
20 rows affected.






    Out[11]:





    
        dates
        weekdays
        latitude
        longitude
        orientation
    
    
        20131114 06:05:19
        Thursday
        42.6040077209
        13.3692359924
        327.700012207
    
    
        20131114 06:05:48
        Thursday
        42.6041603088
        13.3694620132
        325.399993896
    
    
        20131114 06:06:19
        Thursday
        42.6043243408
        13.369717598
        323.700012207
    
    
        20131114 06:06:48
        Thursday
        42.6044769287
        13.3699598312
        325.600006104
    
    
        20131114 06:07:19
        Thursday
        42.6046409607
        13.37022686
        327.5
    
    
        20131114 06:07:49
        Thursday
        42.6047935486
        13.3704900742
        327.799987793
    
    
        20131114 06:08:18
        Thursday
        42.6049423218
        13.3707456589
        330.399993896
    
    
        20131114 06:08:49
        Thursday
        42.6050987244
        13.3710184097
        331.200012207
    
    
        20131114 06:09:19
        Thursday
        42.6052474976
        13.3712882996
        331.100006104
    
    
        20131114 06:09:49
        Thursday
        42.6053924561
        13.3715610504
        331.600006104
    
    
        20131114 06:10:19
        Thursday
        42.6055335999
        13.3718280792
        331.700012207
    
    
        20131114 06:10:48
        Thursday
        42.6056747437
        13.3720989227
        328.700012207
    
    
        20131114 06:11:18
        Thursday
        42.6058120728
        13.3723545074
        326.0
    
    
        20131114 06:11:49
        Thursday
        42.6059684753
        13.3726148605
        320.600006104
    
    
        20131114 06:12:18
        Thursday
        42.6061248779
        13.3728408813
        318.600006104
    
    
        20131114 06:12:49
        Thursday
        42.6062736511
        13.3730554581
        321.5
    
    
        20131114 06:13:18
        Thursday
        42.6064224243
        13.3732671738
        323.399993896
    
    
        20131114 06:13:49
        Thursday
        42.6065788269
        13.3734874725
        325.399993896
    
    
        20131114 06:14:18
        Thursday
        42.6067237854
        13.3736925125
        327.899993896
    
    
        20131114 06:14:49
        Thursday
        42.6068840027
        13.373919487
        327.600006104

It's easy to obtain a pandas.DataFrame from the query result



In [12]:

    
results = %sql SELECT * FROM test_sql
pres = results.DataFrame()
plot(pres.latitude, pres.longitude, '.')









    



20 rows affected.






    Out[12]:





[<matplotlib.lines.Line2D at 0x111de3ed0>]

Using SQLite3 and MySQLdb directly

Using the SQLite3 and MySQLdb modules



In [13]:

    
### You can select here which method to use: sqlite or mysql
if sql_type == 'sqlite':
    import sqlite3
    db = sqlite3.connect('mysqlite.db')
elif sql_type == 'mysql':
    import MySQLdb
    db = MySQLdb.connect(host="10.40.20.14", user="student", passwd="DTU63Course", db="coursedb")
    
c = db.cursor()

(Drop and) Create the table



In [14]:

    
try: 
    c.execute("DROP TABLE "+tbname)
except: 
    print tbname+' doesn\'t exist yet'

c.execute("""
CREATE TABLE %s (
    dates VARCHAR(20),
    weekdays VARCHAR(10),
    latitude DOUBLE,
    longitude DOUBLE,
    orientation DOUBLE
)"""%(tbname))









    Out[14]:





0L

Insert all the data and read it



In [15]:

    
%%logit executemany, mysqlite.db, sql_type
# Insert all the data from db_data
if sql_type == 'sqlite':
    query_insert = 'INSERT INTO '+tbname+' VALUES (?,?,?,?,?)'
elif sql_type == 'mysql':
    query_insert = 'INSERT INTO '+tbname+' VALUES (%s,%s,%s,%s,%s)'

    

## Notice the executemany instead of execute. It's much faster when inserting data.
c.executemany(query_insert, db_data.tolist())

# Don't forget to use this if you want to save your modification on the database
db.commit()









    



executemany_mysql :
14-01-30 00:16:47
File size: -1 MB
Creation time: 3.268174 sec



In [16]:

    
# Read all the data back
c.execute('SELECT * FROM '+tbname)

#Print the first ten rows
for row in c.fetchall()[:10]:
    print row

# Don't forget to close the database    
c.close()









    



('20131114 06:05:19', 'Thursday', 42.6040077209, 13.3692359924, 327.700012207)
('20131114 06:05:48', 'Thursday', 42.6041603088, 13.3694620132, 325.399993896)
('20131114 06:06:19', 'Thursday', 42.6043243408, 13.369717598, 323.700012207)
('20131114 06:06:48', 'Thursday', 42.6044769287, 13.3699598312, 325.600006104)
('20131114 06:07:19', 'Thursday', 42.6046409607, 13.37022686, 327.5)
('20131114 06:07:49', 'Thursday', 42.6047935486, 13.3704900742, 327.799987793)
('20131114 06:08:18', 'Thursday', 42.6049423218, 13.3707456589, 330.399993896)
('20131114 06:08:49', 'Thursday', 42.6050987244, 13.3710184097, 331.200012207)
('20131114 06:09:19', 'Thursday', 42.6052474976, 13.3712882996, 331.100006104)
('20131114 06:09:49', 'Thursday', 42.6053924561, 13.3715610504, 331.600006104)

Little bit of fun with pandas

Pandas as a SQL interface that can use SQLite3 or MySQLdb connectors



In [17]:

    
import pandas as pd
import pandas.io.sql as psql
df1 = psql.frame_query('SELECT * FROM '+tbname, db)
# Indicate that the index is of date format
df1.index = pd.DatetimeIndex(df1.dates) 
df1.head()









    Out[17]:






  
    
      
      dates
      weekdays
      latitude
      longitude
      orientation
    
  
  
    
      2013-11-14 06:05:19
       20131114 06:05:19
       Thursday
       42.604008
       13.369236
       327.700012
    
    
      2013-11-14 06:05:48
       20131114 06:05:48
       Thursday
       42.604160
       13.369462
       325.399994
    
    
      2013-11-14 06:06:19
       20131114 06:06:19
       Thursday
       42.604324
       13.369718
       323.700012
    
    
      2013-11-14 06:06:48
       20131114 06:06:48
       Thursday
       42.604477
       13.369960
       325.600006
    
    
      2013-11-14 06:07:19
       20131114 06:07:19
       Thursday
       42.604641
       13.370227
       327.500000



In [18]:

    
df1.plot()









    Out[18]:





<matplotlib.axes._subplots.AxesSubplot at 0x1119225d0>

You can also use pandas directly to write to your database



In [19]:

    
%%logit pandas_write_frame, mysqlite.db, sql_type
if psql.table_exists(tbname, db, sql_type):
    psql.execute('DROP TABLE '+tbname, db)
psql.write_frame(df1, tbname, db, flavor=sql_type)









    



pandas_write_frame_mysql :
14-01-30 00:17:31
File size: -1 MB
Creation time: 2.08852 sec

Let's time it using sqlite

And reading again from it



In [20]:

    
psql.frame_query('SELECT * FROM '+tbname, db).head()









    Out[20]:






  
    
      
      dates
      weekdays
      latitude
      longitude
      orientation
    
  
  
    
      0
       20131114 06:05:19
       Thursday
       42.6040
       13.3692
       327.7
    
    
      1
       20131114 06:05:48
       Thursday
       42.6042
       13.3695
       325.4
    
    
      2
       20131114 06:06:19
       Thursday
       42.6043
       13.3697
       323.7
    
    
      3
       20131114 06:06:48
       Thursday
       42.6045
       13.3700
       325.6
    
    
      4
       20131114 06:07:19
       Thursday
       42.6046
       13.3702
       327.5

dbTools

dbTools is tool to access SQLite datafiles. It is inspired from SQLAlchelmy, but is much simpler to use (but also less powerfull and less stable)



In [21]:

    
from dbtools import Table
try: Table("mysqlite_dtools.db", tbname).drop()
except: pass
tbl = Table.create("mysqlite_dtools.db", tbname,
    [('dates', str),
     ('weekdays', str),
     ('latitude', float),
     ('longitude', float),
     ('orientation', float)])
tbl.columns









    Out[21]:





(u'dates', u'weekdays', u'latitude', u'longitude', u'orientation')

Once it's created, you can just use it this way. It's very slow, so we reduce it to the first 10,000 entries.



In [22]:

    
%%logit dbtools_insertlist, mysqlite_dtools.db, sqlite
tbl = Table("mysqlite_dtools.db", tbname)
tbl.insert(db_data[:10000])









    



dbtools_insertlist_sqlite :
14-01-30 00:17:43
File size: 6.517578125 MB
Creation time: 5.694544 sec

The select() function directly returns a panadas dataframe



In [23]:

    
df2 = tbl.select()
df2.plot()









    Out[23]:





<matplotlib.axes._subplots.AxesSubplot at 0x111df5c50>

Select function can be used as a SELECT query



In [24]:

    
### This is equivalent to this query:
### SELECT dates, orientation FROM py4we_tb WHERE weekdays="Thursday";
### or df3 = df2[df2.weekdays=='Thursday']
df3 = tbl.select(['orientation', 'latitude'], where='weekdays="Thursday"')
df3.plot()
df3.head()









    Out[24]:






  
    
      
      orientation
      latitude
    
  
  
    
      0
       327.700012
       42.604008
    
    
      1
       325.399994
       42.604160
    
    
      2
       323.700012
       42.604324
    
    
      3
       325.600006
       42.604477
    
    
      4
       327.500000
       42.604641

It's possible to use a pandas.Dataframe as an input to create a new table but it's even slower:



In [25]:

    
%%logit dbtools_insertpandas, mysqlite_dtools2.db, sqlite
try: Table("mysqlite_dtools.db", tbname).drop()
except: print tbname, 'doesnt exist'
Table.create("mysqlite_dtools.db", tbname, df2)









    



dbtools_insertpandas_sqlite :
14-01-30 00:17:52
File size: 0.642578125 MB
Creation time: 22.784908 sec

Unfortunately the module is still at its infancy, and can't work with sqlite files that haven't been created by it (yet?)



In [26]:

    
df5 = Table("mysqlite.db", tbname).select()









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-26-ae0eeea4b26b> in <module>()
----> 1 df5 = Table("mysqlite.db", tbname).select()

/Users/pire/venv/openmdao/lib/python2.7/site-packages/dbtools/table.pyc in __init__(self, db, name, verbose)
    267         # parse the response -- it will look like 'CREATE TABLE
    268         # name(col1 TYPE, col2 TYPE, ...)'
--> 269         args = re.match("([^\(]*)\((.*)\)", info[0][0]).groups()[1]
    270 
    271         # compute repr string

AttributeError: 'NoneType' object has no attribute 'groups'

Plotting



In [27]:

    
from logit import plot_stats
df = plot_stats()

Exercise

You now are going to apply your knowledge to upload the results from the previous exercises to a mysql database. You should upload the results logged in the JSON to this table: py4we_results.

The py4we_results table is organised as followed:

userid: Your own user id you used in this class so far (e.g. 4 letter code)
data: The date/time you have run the test
testid: The test id
filesize: The file size found in the test (when irrelevent use -1)
time: The time it took to run the test (when irrelevent use -1)



In [36]:

    
%sql mysql://student:DTU63Course@10.40.20.14/coursedb 
%sql DESCRIBE py4we_exercise









    



5 rows affected.






    Out[36]:





    
        Field
        Type
        Null
        Key
        Default
        Extra
    
    
        userid
        varchar(10)
        YES
        
        None
        
    
    
        date
        varchar(20)
        YES
        
        None
        
    
    
        testid
        varchar(10)
        YES
        
        None
        
    
    
        filesize
        double
        YES
        
        None
        
    
    
        time
        double
        YES
        
        None



In [48]:

    
%sql SELECT * FROM py4we_exercise LIMIT 10









    



10 rows affected.






    Out[48]:





    
        userid
        date
        testid
        filesize
        time
    
    
        pire
        14-01-29 23:53:57
        readlines
        6.32970333099
        0.022118
    
    
        pire
        14-01-29 23:53:58
        read_csv
        6.32970333099
        0.224039
    
    
        pire
        14-01-29 23:54:10
        read_csv_module
        6.32970333099
        0.232334
    
    
        pire
        14-01-29 23:55:14
        save_csv
        6.19353675842
        0.101102
    
    
        pire
        14-01-29 23:55:53
        save_csv
        6.19353675842
        0.089683
    
    
        pire
        14-01-29 23:56:35
        save_csv_module
        6.32970333099
        0.240466
    
    
        pire
        14-01-29 23:56:41
        save_csv_module
        6.32970333099
        0.232233
    
    
        pire
        14-01-29 23:57:27
        savetxt
        17.0224962234
        1.086016
    
    
        pire
        14-01-29 23:58:04
        savetxt
        5.44677352905
        0.037129
    
    
        pire
        14-01-29 23:58:09
        savetxt
        5.44666290283
        0.010352

DO NOT ERASE THE RESULTS OF YOUR COLLEAGUES, WE WILL HUNT YOU DOWN!!! USE APPEND

Backing up



In [1]:

    
import sqlite3
import pandas as pd
import pandas.io.sql as psql
import MySQLdb
    
dbmysql = MySQLdb.connect(host="10.40.20.14", user="student", passwd="DTU63Course", db="coursedb")
df = pd.io.sql.frame_query('SELECT * FROM py4we_exercise', dbmysql)
dblite = sqlite3.connect('sqlite_backup.db')
pd.io.sql.write_frame(df, 'py4we_exercise', dblite, flavor='sqlite', if_exists='append')

Solution:



In [ ]:

    
import sqlite3
dblite = sqlite3.connect('sqlite_logger.db')
df = pd.io.sql.frame_query('SELECT * FROM logger', dblite)
dbmysql = MySQLdb.connect(host="10.40.20.14", user="student", passwd="DTU63Course", db="coursedb")
pd.io.sql.write_frame(df, 'py4we_exercise', dbmysql, flavor='mysql', if_exists='append')

A little bit more fun with Pandas



In [2]:

    
df.head()









    Out[2]:






  
    
      
      userid
      date
      testid
      filesize
      time
    
  
  
    
      0
       pire
       14-01-29 23:53:57
             readlines
       6.329703
       0.022118
    
    
      1
       pire
       14-01-29 23:53:58
              read_csv
       6.329703
       0.224039
    
    
      2
       pire
       14-01-29 23:54:10
       read_csv_module
       6.329703
       0.232334
    
    
      3
       pire
       14-01-29 23:55:14
              save_csv
       6.193537
       0.101102
    
    
      4
       pire
       14-01-29 23:55:53
              save_csv
       6.193537
       0.089683

Plotting the benchmarks

This line is achieving the following manipulation:

Group by the userids and testids
take the time column
average
redistribute the indices
pivot the tables so that the userids become the column name instead of a value, and index by testid
plot with horizontal bars.



In [32]:

    
df.groupby(['userid', 'testid'])['time'].mean().reset_index() \
        .pivot(index='testid', columns='userid', values='time').plot(kind='barh')









    Out[32]:





<matplotlib.axes._subplots.AxesSubplot at 0x106e62fd0>

For maintenance purpose



In [46]:

    
#sql DROP TABLE py4we_exercise



In [ ]:

    
##sql 
CREATE TABLE py4we_exercise (
    userid VARCHAR(10),
    date VARCHAR(20),
    testid VARCHAR(50),
    filesize DOUBLE,
    time DOUBLE
);

dates	weekdays	latitude	longitude	orientation
20131114 06:05:19	Thursday	42.6040077209	13.3692359924	327.700012207
20131114 06:05:48	Thursday	42.6041603088	13.3694620132	325.399993896
20131114 06:06:19	Thursday	42.6043243408	13.369717598	323.700012207
20131114 06:06:48	Thursday	42.6044769287	13.3699598312	325.600006104
20131114 06:07:19	Thursday	42.6046409607	13.37022686	327.5
20131114 06:07:49	Thursday	42.6047935486	13.3704900742	327.799987793
20131114 06:08:18	Thursday	42.6049423218	13.3707456589	330.399993896
20131114 06:08:49	Thursday	42.6050987244	13.3710184097	331.200012207
20131114 06:09:19	Thursday	42.6052474976	13.3712882996	331.100006104
20131114 06:09:49	Thursday	42.6053924561	13.3715610504	331.600006104
20131114 06:10:19	Thursday	42.6055335999	13.3718280792	331.700012207
20131114 06:10:48	Thursday	42.6056747437	13.3720989227	328.700012207
20131114 06:11:18	Thursday	42.6058120728	13.3723545074	326.0
20131114 06:11:49	Thursday	42.6059684753	13.3726148605	320.600006104
20131114 06:12:18	Thursday	42.6061248779	13.3728408813	318.600006104
20131114 06:12:49	Thursday	42.6062736511	13.3730554581	321.5
20131114 06:13:18	Thursday	42.6064224243	13.3732671738	323.399993896
20131114 06:13:49	Thursday	42.6065788269	13.3734874725	325.399993896
20131114 06:14:18	Thursday	42.6067237854	13.3736925125	327.899993896
20131114 06:14:49	Thursday	42.6068840027	13.373919487	327.600006104

	dates	weekdays	latitude	longitude	orientation
2013-11-14 06:05:19	20131114 06:05:19	Thursday	42.604008	13.369236	327.700012
2013-11-14 06:05:48	20131114 06:05:48	Thursday	42.604160	13.369462	325.399994
2013-11-14 06:06:19	20131114 06:06:19	Thursday	42.604324	13.369718	323.700012
2013-11-14 06:06:48	20131114 06:06:48	Thursday	42.604477	13.369960	325.600006
2013-11-14 06:07:19	20131114 06:07:19	Thursday	42.604641	13.370227	327.500000

	dates	weekdays	latitude	longitude	orientation
0	20131114 06:05:19	Thursday	42.6040	13.3692	327.7
1	20131114 06:05:48	Thursday	42.6042	13.3695	325.4
2	20131114 06:06:19	Thursday	42.6043	13.3697	323.7
3	20131114 06:06:48	Thursday	42.6045	13.3700	325.6
4	20131114 06:07:19	Thursday	42.6046	13.3702	327.5

Field	Type	Null	Default
userid	varchar(10)	YES	None
date	varchar(20)	YES	None
testid	varchar(10)	YES	None
filesize	double	YES	None
time	double	YES	None

userid	date	testid	filesize	time
pire	14-01-29 23:53:57	readlines	6.32970333099	0.022118
pire	14-01-29 23:53:58	read_csv	6.32970333099	0.224039
pire	14-01-29 23:54:10	read_csv_module	6.32970333099	0.232334
pire	14-01-29 23:55:14	save_csv	6.19353675842	0.101102
pire	14-01-29 23:55:53	save_csv	6.19353675842	0.089683
pire	14-01-29 23:56:35	save_csv_module	6.32970333099	0.240466
pire	14-01-29 23:56:41	save_csv_module	6.32970333099	0.232233
pire	14-01-29 23:57:27	savetxt	17.0224962234	1.086016
pire	14-01-29 23:58:04	savetxt	5.44677352905	0.037129
pire	14-01-29 23:58:09	savetxt	5.44666290283	0.010352